Feedback should be send to goran.milovanovic@datakolektiv.com.
These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.


We will now begin to introduce a set of extremely useful statistical learning models. Their name - Generalized Linear Models (GLMs) - suggests that their are somehow related to Simple and Multiple Linear Regression models and yet somehow go beyond them. That is correct: GLMs generalize the linear model, where predictors and their respective coefficients produce a linear combination of vectors, by introducing link functions to solve those kinds of problems that cannot be handled by Linear Regression. For example, what if the problem is not to predict a continuous value of the criterion, but the outcome variable is rather a dichotomy and then the problem becomes the one of categorization? E.g. predict the sex of a respondent given a set ${X}$ of their features? Enters Binomial Logistic Regression, the simplest GLM.
Another thing: GLMs cannot be estimated by minimizing the quadratic error as we have estimated Simple and Multiple Linear Regression in the previous Session15. The method used to estimate Linear Models is known as Least Squares Estimation. To fit GLMs to our data, we will introduce the concept of Likelihood in Probability Theory and learn about the Maximum Likelihood Estimate!
Let us briefly recall the assumptions of the (Multiple) Linear Regression model:
-Inf to Inf.What if we observe a set of variables that somehow describe a statistical experiment that can result in any of the two discrete outcomes? For example, we observe a description of a behavior of a person, quantified in some way, and organized into a set of variables that should be used to predict the sex of that person? Or any other similar problem where the outcome can take only two values, say 0 or 1 (and immediately recall the Binomial Distribution)?
The assumptions of the Linear Model obviously constrain its application in such cases. We ask the following question now: would it be possible to generalize, or expand, modify the Linear Model somehow to be able to encompass the categorization problem? Because it sounds so appealing to be able to have a set of predictors, combine them in a linear fashion, and estimate the coefficients so to be able to predict whether the outcome would turn this way or another?
There is a way to develop such a generalization of the Linear Model. In its simplest form it represents the Binomial Logistic Regression. Binomial Logistic Regression is very similar to multiple regression, except that for the outcome variable is now a categorical variable (i.e. it is measured on a nominal scale that is a dichotomy).
### --- Setup - importing the libraries
# - supress those annoying 'Future Warning'
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# - data
import numpy as np
import pandas as pd
# - os
import os
# - ml
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
# - visualization
import matplotlib.pyplot as plt
import seaborn as sns
# - parameters
%matplotlib inline
pd.options.mode.chained_assignment = None # default='warn'
sns.set_theme()
# - rng
rng = np.random.default_rng()
# - plots
plt.rc("figure", figsize=(8, 6))
plt.rc("font", size=14)
sns.set_theme(style='white')
# - directory tree
data_dir = os.path.join(os.getcwd(), '_data')
Let's recall the form of the Linear Model with any number of predictors:
$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k + \epsilon$$So we have a linear combination of $k$ predictors $\boldsymbol{X}$ plus the model error term $\epsilon$ on the RHS, and the outcome variable $Y$ on the LHS.
Now we assume that $Y$ can take only two possible values, call them $0$ and $1$ for ease of discussion. We want to predict whether $Y$ will happen to be ($1$) or not ($0$) given our observations of a set of predictors $\boldsymbol{X}$. However, in Binary Logistic Regression we do not predict the value of the outcome itself, but rather the probability that the outcome will turn out $1$ or $0$ given the predictors.
In the simplest possible case, where there is only one predictor $X_1$, this is exactly what we predict in Binary Logistic Regression:
$$P(Y) = p_1 = \frac{1}{1+e^{-(\beta_0 + \beta_1X_1)}}$$where $\beta_0$ and $\beta_1$ are the same old good linear model coefficients. As we will see, the linear coefficients have a new interpretation in Binary Logistic Regression - a one rather different that the one they receive in the scope of the Linear Model.
With $k$ predictors we have:
$$P(Y) = p_1 = \frac{1}{1+e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k)}}$$Now the above equations looks like it felt from the clear blue sky to solve the problem. There is a clear motivation for its form, of course: imagine that instead of predicting the state of $Y$ directly we decide to predicts the odds of $Y$ turning out $1$ instead of $0$:
$$odds = \frac{p_1}{1-p_1}$$Now goes the trick: if instead of predicting the odds $p_1/(1-p_1)$ we decide to predict the log-odds (also called: logit) from a linear combination of predictors
$$log \left( \frac{p_1}{1-p_1} \right) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k$$it turns out that we can recover the odds by taking the exponent of both LHS and RHS:
$$\frac{p_1}{1-p_1} = e^{(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k)}$$and then by following a simple algebraic rearrangement:
(1) let's write $l=\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k$ for simplicity and use $l$ to replace the whole linear combination in whats follows;
(2) we have $log \left( \frac{p_1}{1-p_1} \right)=l$ then;
(3) taking the exponent of both sides we arive at $\frac{p_1}{1-p_1}=e^l$;
(4) immediately follows that
$\frac{p_1}{1-p_1}=\frac{1}{e^{-l}} \implies p_1=\frac{1-p_1}{e^{-l}} \implies p_1+\frac{p1}{e^{-l}}=\frac{1}{e^{-l}}\implies p_1e^{-l}+p_1=1 \implies p_1(1+e^{-l})=1$
and after rewriting $l$ as a linear combination again we find that the probability $p_1$ of the outcome $Y$ turning out $1$ is:
$$P(Y) = p_1 = \frac{1}{1+e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_kX_k)}}$$Now, imagine we set a following criterion: anytime we estimate $p_1$ to be larger than or equal to $.5$ we predict that $Y=1$, and anytime $p_1 < .5$ we predict that $Y=0$. What we need to do in order to be able to learn how to predict $Y$ in this way is to estimate the coefficients $b_0$, $b_1$, $b_2$, etc like we did in the case of a linear model. However, minimizing SSE will not work in this case: our predictions will be on a probability scale, while our observations are discrete, $0$ or $1$. We will have to find another way.
# - loading the dataset
# - GitHub: https://github.com/dijendersaini/Customer-Churn-Model/blob/master/churn_data.csv
# - place it in your _data/ directory
churn_data = pd.read_csv(os.path.join(data_dir, 'churn_data.csv'))
churn_data.head(10)
| customerID | tenure | PhoneService | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | 1 | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | 34 | Yes | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | 2 | Yes | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | 45 | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | 2 | Yes | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
| 5 | 9305-CDSKC | 8 | Yes | Month-to-month | Yes | Electronic check | 99.65 | 820.5 | Yes |
| 6 | 1452-KIOVK | 22 | Yes | Month-to-month | Yes | Credit card (automatic) | 89.10 | 1949.4 | No |
| 7 | 6713-OKOMC | 10 | No | Month-to-month | No | Mailed check | 29.75 | 301.9 | No |
| 8 | 7892-POOKP | 28 | Yes | Month-to-month | Yes | Electronic check | 104.80 | 3046.05 | Yes |
| 9 | 6388-TABGU | 62 | Yes | One year | No | Bank transfer (automatic) | 56.15 | 3487.95 | No |
churn_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null object 1 tenure 7043 non-null int64 2 PhoneService 7043 non-null object 3 Contract 7043 non-null object 4 PaperlessBilling 7043 non-null object 5 PaymentMethod 7043 non-null object 6 MonthlyCharges 7043 non-null float64 7 TotalCharges 7043 non-null object 8 Churn 7043 non-null object dtypes: float64(1), int64(1), object(7) memory usage: 495.3+ KB
# - some entries have missing values given as empty strings
churn_data.loc[488]
customerID 4472-LVYGI tenure 0 PhoneService No Contract Two year PaperlessBilling Yes PaymentMethod Bank transfer (automatic) MonthlyCharges 52.55 TotalCharges Churn No Name: 488, dtype: object
# - use .replace method to replace empty strings with NaN values
churn_data = churn_data.replace(r'^\s*$', np.nan, regex=True)
churn_data.loc[488]
customerID 4472-LVYGI tenure 0 PhoneService No Contract Two year PaperlessBilling Yes PaymentMethod Bank transfer (automatic) MonthlyCharges 52.55 TotalCharges NaN Churn No Name: 488, dtype: object
# - we drop all the entries with missing values
churn_data = churn_data.dropna().reset_index(drop=True)
churn_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7032 entries, 0 to 7031 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7032 non-null object 1 tenure 7032 non-null int64 2 PhoneService 7032 non-null object 3 Contract 7032 non-null object 4 PaperlessBilling 7032 non-null object 5 PaymentMethod 7032 non-null object 6 MonthlyCharges 7032 non-null float64 7 TotalCharges 7032 non-null object 8 Churn 7032 non-null object dtypes: float64(1), int64(1), object(7) memory usage: 494.6+ KB
# - notice that 'TotalCharges' values are non-numeric type, but they should be
# - this is due to the empty string values that were previously present
# - we convert them to numeric type
churn_data['TotalCharges'] = churn_data['TotalCharges'].astype('float')
churn_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7032 entries, 0 to 7031 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7032 non-null object 1 tenure 7032 non-null int64 2 PhoneService 7032 non-null object 3 Contract 7032 non-null object 4 PaperlessBilling 7032 non-null object 5 PaymentMethod 7032 non-null object 6 MonthlyCharges 7032 non-null float64 7 TotalCharges 7032 non-null float64 8 Churn 7032 non-null object dtypes: float64(2), int64(1), object(6) memory usage: 494.6+ KB
We use Binomial Logistic Regression to predict the probability $p$ of a given observation $\textbf{x}$, with features $(x_1, x_2, \ldots, x_k)$, belonging to one of the two binary categories {0, 1}. We compute these probabilites via
$$p = \frac{1}{1+e^{\beta_1x_1 + \beta_2x_2 + \cdots + \beta_kx_k + \beta_0}},$$where $\beta_1, \beta_2, \ldots, \beta_k$ are the model's parameters for the predictors, and $\beta_0$ is the intercept of the model.
In order to determine whether the predicted label $\hat{y}$ for a given observation $\textbf{x}$ has binary label 1 or 0, we impose a decision criterion $\sigma$ - a number in the (0, 1) interval. If $p > \sigma$, then we assign label $\hat{y} = 1$ to the observation $\textbf{x}$; else, we take $\hat{y} = 0$. Ususally, we take $\sigma = 0.5$.
The model is optimized by MLE (Maximum Likelihood Estimation), and the interpretation of the model coefficients is the following:
### --- Preparing the model frame
# - extracting 'Churn' and all the numerical features columns
model_frame = churn_data[['Churn', 'tenure', 'MonthlyCharges', 'TotalCharges']]
model_frame.head()
| Churn | tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|---|
| 0 | No | 1 | 29.85 | 29.85 |
| 1 | No | 34 | 56.95 | 1889.50 |
| 2 | Yes | 2 | 53.85 | 108.15 |
| 3 | No | 45 | 42.30 | 1840.75 |
| 4 | Yes | 2 | 70.70 | 151.65 |
# - encoding 'Churn' values to binary values
model_frame['Churn'] = model_frame['Churn'].apply(lambda x: int(x == 'Yes'))
model_frame.head()
| Churn | tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|---|
| 0 | 0 | 1 | 29.85 | 29.85 |
| 1 | 0 | 34 | 56.95 | 1889.50 |
| 2 | 1 | 2 | 53.85 | 108.15 |
| 3 | 0 | 45 | 42.30 | 1840.75 |
| 4 | 1 | 2 | 70.70 | 151.65 |
predictors = model_frame.columns.drop('Churn')
predictors
Index(['tenure', 'MonthlyCharges', 'TotalCharges'], dtype='object')
# --- Composing the fomula of the model
# - right side of the formula
formula = ' + '.join(predictors)
# - left side of the formula
formula = 'Churn ~ ' + formula
formula
'Churn ~ tenure + MonthlyCharges + TotalCharges'
# - fitting BLR model to the data
binomial_linear_model = smf.logit(formula=formula, data=model_frame).fit()
Optimization terminated successfully.
Current function value: 0.453372
Iterations 7
binomial_linear_model.summary()
| Dep. Variable: | Churn | No. Observations: | 7032 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 7028 |
| Method: | MLE | Df Model: | 3 |
| Date: | Fri, 28 Apr 2023 | Pseudo R-squ.: | 0.2170 |
| Time: | 23:39:56 | Log-Likelihood: | -3188.1 |
| converged: | True | LL-Null: | -4071.7 |
| Covariance Type: | nonrobust | LLR p-value: | 0.000 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -1.5988 | 0.117 | -13.628 | 0.000 | -1.829 | -1.369 |
| tenure | -0.0671 | 0.005 | -12.297 | 0.000 | -0.078 | -0.056 |
| MonthlyCharges | 0.0302 | 0.002 | 17.585 | 0.000 | 0.027 | 0.034 |
| TotalCharges | 0.0001 | 6.14e-05 | 2.361 | 0.018 | 2.47e-05 | 0.000 |
N.B. There is a bug in the Wald's Z, look:
The reason why the Wald statistic should be used cautiously is because, when the regression coefficient is large, the standard error tends to become inflated, resulting in the Wald statistic being underestimated (see Menard, 1995). The inflation of the standard error increases the probability of rejecting a predictor as being significant when in reality it is making a significant contribution to the model (i.e. you are more likely to make a Type II error). From: Andy Field, DISCOVERING STATISTICS USING SPSS, Third Edition, Sage.
# - model's parameters
binomial_linear_model.params
Intercept -1.598827 tenure -0.067114 MonthlyCharges 0.030200 TotalCharges 0.000145 dtype: float64
# - exponential of the model's parameters
np.exp(binomial_linear_model.params)
Intercept 0.202133 tenure 0.935088 MonthlyCharges 1.030660 TotalCharges 1.000145 dtype: float64
The $\Delta Odds$ (Odds Ratio)
Do not forget that you have transformed your linear combination of model coefficients and predictors into a log-odds space: the logistic regression coefficient $\beta$ associated with a predictor X is the expected change in log(odds).
So, by taking $e^{\beta_i}$, your coefficient now says:
Which means that
# - predicting the probabilities
probabilities = binomial_linear_model.predict()
probabilities[:10]
array([0.31861382, 0.13162129, 0.47723817, 0.04417324, 0.60445587,
0.72962176, 0.47459352, 0.2095354 , 0.53216748, 0.02770232])
# - predicting binary labels, taking \sigma = 0.5
predictions = (probabilities > .5).astype('int')
predictions[:10]
array([0, 0, 0, 0, 1, 1, 0, 0, 1, 0])
# - observed vs. predicted labels
predictions_df = pd.DataFrame()
predictions_df['observation'] = model_frame['Churn']
predictions_df['prediction'] = predictions
predictions_df.head()
| observation | prediction | |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 0 | 0 |
| 2 | 1 | 0 |
| 3 | 0 | 0 |
| 4 | 1 | 1 |
# - accuracy of the model
accuracy = predictions_df['observation'] == predictions_df['prediction']
accuracy = np.sum(accuracy)/len(accuracy)
np.round(accuracy, 4)
0.785
### --- Model diagnostics
# - Log-likelihood of model
model_loglike = binomial_linear_model.llf
model_loglike
-3188.1130873174216
# - model deviance
residuals_deviance = binomial_linear_model.resid_dev
model_deviance = np.sum(residuals_deviance**2)
model_deviance
6376.226174634843
# - another way to compute model deviance
np.sum(residuals_deviance**2) == -2*model_loglike
True
The Akaike Information Criterion (AIC) is a statistical measure used to evaluate the goodness-of-fit of a model. It is based on the principle of parsimony, which states that simpler models should be preferred over more complex ones, all else being equal.
The AIC is defined as follows:
$$AIC = -2\ln(\mathcal{L}) + 2k $$where $\mathcal{L}$ is the model likelihood and $k$ is the number of parameters in the model.
The AIC penalizes models with more parameters, by adding a penalty term $2k$ to the log-likelihood $-2\ln(\mathcal{L})$. This penalty term is larger for models with more parameters, and hence it discourages overfitting and encourages simpler models.
The AIC can be used to compare different models and select the best one based on their AIC values. The model with the lowest AIC value is preferred, as it strikes a good balance between goodness-of-fit and simplicity.
# - Akaike Information Criterion (AIC)
binomial_linear_model.aic
6384.226174634843
# - another way to compute AIC
aic = -2*model_loglike + 2*len(predictors)
aic
6382.226174634843
Model effect: comparison to the Null Model
# - Log-likelihood of model
model_loglike = binomial_linear_model.llf
model_loglike
-3188.1130873174216
# Value of the constant-only loglikelihood
null_loglike = binomial_linear_model.llnull
null_loglike
-4071.6775733255813
# - Comparison to the Null Model which follows the Chi-Square distribution
# - differece between deviances of the Null Model and our model:
# - Likelihood ratio chi-squared statistic; -2*(llnull - llf)
dev_diff = binomial_linear_model.llr
dev_diff
1767.1289720163195
-2*(null_loglike - model_loglike)
1767.1289720163195
# - exponential of the parameters and AIC of the model using only numerical predictors (a reminder)
np.exp(binomial_linear_model.params)
Intercept 0.202133 tenure 0.935088 MonthlyCharges 1.030660 TotalCharges 1.000145 dtype: float64
binomial_linear_model.aic
6384.226174634843
### --- Prepering the dataset
# - droping the 'customerID' column
model_frame = churn_data.drop(columns='customerID')
model_frame.head()
| tenure | PhoneService | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 34 | Yes | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 2 | 2 | Yes | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 45 | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 2 | Yes | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
model_frame.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7032 entries, 0 to 7031 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tenure 7032 non-null int64 1 PhoneService 7032 non-null object 2 Contract 7032 non-null object 3 PaperlessBilling 7032 non-null object 4 PaymentMethod 7032 non-null object 5 MonthlyCharges 7032 non-null float64 6 TotalCharges 7032 non-null float64 7 Churn 7032 non-null object dtypes: float64(2), int64(1), object(5) memory usage: 439.6+ KB
# - encoding 'Churn' column to binary values
model_frame['Churn'] = model_frame['Churn'].apply(lambda x: int(x == 'Yes'))
model_frame.head()
| tenure | PhoneService | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | 0 |
| 1 | 34 | Yes | One year | No | Mailed check | 56.95 | 1889.50 | 0 |
| 2 | 2 | Yes | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | 1 |
| 3 | 45 | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | 0 |
| 4 | 2 | Yes | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | 1 |
predictors = model_frame.columns.drop('Churn')
predictors
Index(['tenure', 'PhoneService', 'Contract', 'PaperlessBilling',
'PaymentMethod', 'MonthlyCharges', 'TotalCharges'],
dtype='object')
# --- Composing the fomula of the model
# - right side of the formula
formula = ' + '.join(predictors)
# - left side of the formula
formula = 'Churn ~ ' + formula
formula
'Churn ~ tenure + PhoneService + Contract + PaperlessBilling + PaymentMethod + MonthlyCharges + TotalCharges'
# - fitting BLR model to the data
binomial_linear_model = smf.logit(formula=formula, data=model_frame).fit()
Optimization terminated successfully.
Current function value: 0.424667
Iterations 8
binomial_linear_model.summary()
| Dep. Variable: | Churn | No. Observations: | 7032 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 7021 |
| Method: | MLE | Df Model: | 10 |
| Date: | Fri, 28 Apr 2023 | Pseudo R-squ.: | 0.2666 |
| Time: | 23:39:57 | Log-Likelihood: | -2986.3 |
| converged: | True | LL-Null: | -4071.7 |
| Covariance Type: | nonrobust | LLR p-value: | 0.000 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -0.8446 | 0.176 | -4.794 | 0.000 | -1.190 | -0.499 |
| PhoneService[T.Yes] | -0.8419 | 0.122 | -6.926 | 0.000 | -1.080 | -0.604 |
| Contract[T.One year] | -0.9171 | 0.103 | -8.876 | 0.000 | -1.120 | -0.715 |
| Contract[T.Two year] | -1.8135 | 0.172 | -10.565 | 0.000 | -2.150 | -1.477 |
| PaperlessBilling[T.Yes] | 0.4230 | 0.073 | 5.812 | 0.000 | 0.280 | 0.566 |
| PaymentMethod[T.Credit card (automatic)] | -0.1006 | 0.112 | -0.895 | 0.371 | -0.321 | 0.120 |
| PaymentMethod[T.Electronic check] | 0.4126 | 0.093 | 4.453 | 0.000 | 0.231 | 0.594 |
| PaymentMethod[T.Mailed check] | -0.1194 | 0.113 | -1.061 | 0.289 | -0.340 | 0.101 |
| tenure | -0.0606 | 0.006 | -9.932 | 0.000 | -0.073 | -0.049 |
| MonthlyCharges | 0.0225 | 0.002 | 11.022 | 0.000 | 0.019 | 0.027 |
| TotalCharges | 0.0003 | 6.81e-05 | 4.205 | 0.000 | 0.000 | 0.000 |
# - exponentials of the new model parameters
np.exp(binomial_linear_model.params)
Intercept 0.429725 PhoneService[T.Yes] 0.430907 Contract[T.One year] 0.399695 Contract[T.Two year] 0.163084 PaperlessBilling[T.Yes] 1.526520 PaymentMethod[T.Credit card (automatic)] 0.904265 PaymentMethod[T.Electronic check] 1.510691 PaymentMethod[T.Mailed check] 0.887475 tenure 0.941170 MonthlyCharges 1.022802 TotalCharges 1.000286 dtype: float64
# - AIC of the new model
binomial_linear_model.aic
5994.512151127206
# - import scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
### --- Preparing the variables
# - feature matrix
X = churn_data[['tenure', 'MonthlyCharges', 'TotalCharges']].values
# - target variable
y = churn_data['Churn'].apply(lambda x: int(x == 'Yes'))
## --- Fitting the logistic model to the numerical data
log_reg = LogisticRegression()
log_reg.fit(X, y)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
# - coefficients of the model
log_reg.coef_, log_reg.intercept_
(array([[-0.06711264, 0.03019993, 0.00014507]]), array([-1.59884834]))
# - exponentials of the model coefficients
np.exp(log_reg.coef_), np.exp(log_reg.intercept_)
(array([[0.93508987, 1.03066058, 1.00014508]]), array([0.20212917]))
# - model's accuracy
round(log_reg.score(X, y), 4)
0.785
# - confusion matrix for the given data
y_pred = log_reg.predict(X)
confusion_matrix(y, y_pred)
array([[4693, 470],
[1042, 827]])
churn_data.head()
| customerID | tenure | PhoneService | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | 1 | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | 34 | Yes | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 2 | 3668-QPYBK | 2 | Yes | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | 45 | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | 2 | Yes | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
### --- Composing the feature matrix
# - dropping all the non-numerical and non-binary categorical columns
X0 = churn_data.drop(columns=['customerID', 'Contract', 'PaymentMethod', 'Churn'])
# - encoding binary categorical features to binary values
X0['PaperlessBilling'] = X0['PaperlessBilling'].apply(lambda x: int(x == 'Yes'))
X0['PhoneService'] = X0['PhoneService'].apply(lambda x: int(x == 'Yes'))
X0.head()
| tenure | PhoneService | PaperlessBilling | MonthlyCharges | TotalCharges | |
|---|---|---|---|---|---|
| 0 | 1 | 0 | 1 | 29.85 | 29.85 |
| 1 | 34 | 1 | 0 | 56.95 | 1889.50 |
| 2 | 2 | 1 | 1 | 53.85 | 108.15 |
| 3 | 45 | 0 | 0 | 42.30 | 1840.75 |
| 4 | 2 | 1 | 1 | 70.70 | 151.65 |
# - casting the data frame into a matrix
X0 = X0.values
X0
array([[1.0000e+00, 0.0000e+00, 1.0000e+00, 2.9850e+01, 2.9850e+01],
[3.4000e+01, 1.0000e+00, 0.0000e+00, 5.6950e+01, 1.8895e+03],
[2.0000e+00, 1.0000e+00, 1.0000e+00, 5.3850e+01, 1.0815e+02],
...,
[1.1000e+01, 0.0000e+00, 1.0000e+00, 2.9600e+01, 3.4645e+02],
[4.0000e+00, 1.0000e+00, 1.0000e+00, 7.4400e+01, 3.0660e+02],
[6.6000e+01, 1.0000e+00, 1.0000e+00, 1.0565e+02, 6.8445e+03]])
# - categories of the 'Contract' variable
churn_data['Contract'].unique()
array(['Month-to-month', 'One year', 'Two year'], dtype=object)
# - categories of the 'PaymentMethod' variable
churn_data['PaymentMethod'].unique()
array(['Electronic check', 'Mailed check', 'Bank transfer (automatic)',
'Credit card (automatic)'], dtype=object)
# - we want to recreate the previous statsmodels model that was using all the predictors
# - to achieve this we one-hot (dummy) encode non-binary categorical predictors
# - statsmodels chooses the first category in order of appearance in the dataset as the reference category
# - we pass the reference category manually as an argument to the OneHotEncoder
enc_contract = OneHotEncoder(drop=['Month-to-month'], sparse=False)
dummy_contract = enc_contract.fit_transform(churn_data['Contract'].values.reshape(-1, 1))
enc_payment = OneHotEncoder(drop=['Bank transfer (automatic)'], sparse=False)
dummy_payment = enc_payment.fit_transform(churn_data['PaymentMethod'].values.reshape(-1, 1))
# - concatenating values of the numerical predictors and encoded binary values with the encoded non-binary values
# - into a feature matrix
X = np.concatenate((X0, dummy_contract, dummy_payment), axis=-1)
display(X)
# - target variable; encoding to binary values
y = churn_data['Churn'].apply(lambda x: int(x == 'Yes'))
array([[ 1., 0., 1., ..., 0., 1., 0.],
[34., 1., 0., ..., 0., 0., 1.],
[ 2., 1., 1., ..., 0., 0., 1.],
...,
[11., 0., 1., ..., 0., 1., 0.],
[ 4., 1., 1., ..., 0., 0., 1.],
[66., 1., 1., ..., 0., 0., 0.]])
### --- Fitting the logistic model to all the data
log_reg = LogisticRegression(solver='newton-cg', penalty='none')
log_reg.fit(X, y)
LogisticRegression(penalty='none', solver='newton-cg')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(penalty='none', solver='newton-cg')
# - model's accuracy
round(log_reg.score(X, y), 4)
0.7964
# - exponential of the model parameters
# - ordering corresponds to the ordering of the features in the feature matrix
np.exp(log_reg.coef_), np.exp(log_reg.intercept_)
(array([[0.9411695 , 0.43090698, 1.52652006, 1.0228018 , 1.00028639,
0.39969521, 0.16308371, 0.90426493, 1.51069097, 0.88747494]]),
array([0.42972508]))
# - confusion matrix for the given data
y_pred = log_reg.predict(X)
confusion_matrix(y, y_pred)
array([[4615, 548],
[ 884, 985]])
Say we have observed the following data: $HHTHTTHHHT$. Assume that we know the parameter $p_H$. We can compute the Likelihood function from the following equation:
$\mathcal{L}(p_H|HHTHTTHHHT)$ exactly as we did before. Now, this is the general form of the Binomial Likelihood (where $Y$ stands for the observed data):
$$\mathcal{L}(p|Y) = p_1^y(1-p_1)^{n-y}$$where $y$ is the number of successes and $n$ the total number of observations. For each observed data point then we have
$$\mathcal{L}(p|y_i) = p_1^{y_i}(1-p_1)^{\bar{y_i}}$$where ${y_i}$ is the observed value of the outcome, $Y$, and $\bar{y_i}$ is its complement (e.g. $1$ for $0$ and $0$ for $1$). This form just determines which value will be used in the computation of the Likelihood function at each observed data point: it will be either $p_1$ or $1-p_1$. The likelihood function for a given value of $p_1$ for the whole dataset is computed by multiplying the values of $\mathcal{L}(p|y_i)$ across the whole dataset (remember that multiplication in Probability is what conjunction is in Logic and Algebra).
Q: But... how do we get to $p_1$, the parameter value that we will use at each data point? A: We will search the parameter space, of course, $\beta_0, \beta_1, ... \beta_k$ of linear coefficients in our Binary Logistic Model, computing $p_1$ every time, and compute the likelihood function from it! In other words: we will search the parameter space to find the combination of $\beta_0, \beta_1, ... \beta_k$ that produces the maximum of the likelihood function similarly as we have searched the space of linear coefficients to find the combination that minimizes the squared error in Simple Linear Regression.
So what combination of the linear coefficients is the best one?
It is the one which gives the Maximum Likelihood. This approach, known as Maximum Likelihood Estimation (MLE), stands behind many important statistical learning models. It presents the corner stone of the Statistical Estimation Theory. It is contrasted with the Least Squares Estimation that we have earlier used to estimate Simple and Multiple Linear Regression models.
Now, there is a technical problem related to this approach. To obtain the likelihood for the whole dataset one needs to multiply as many very small numbers as there are data points. That can cause computational problems related to the smallest real numbers that can be represented by digital computers. The workaround is to use the logarithm of likelihood instead, known as Log-Likelihood ($LL$).
Thus, while the Likelihood function for the whole dataset would be
$$\mathcal{L}(p|Y) = \prod_{i=1}^{n}p_1^{y_i}(1-p_1)^{\bar{y_i}}$$the Log-Likelihood function would be:
$$LL(p|Y) = \sum_{i=1}^{n} y_ilog(p_1)+\bar{y_i}log(1-p_1)$$And finally here is how we solve the Binomial Logistic Regression problem:
Technically, in optimization we would not go exactly for the maximum of the Likelihood function, because we use $LL$ instead of $\mathcal{L}(p|Y)$. The solution is to minimize the negative $LL$, sometimes written simply as $NLL$, the Negative Log-Likelihood function.
model_frame = churn_data[['Churn', 'MonthlyCharges', 'TotalCharges']]
# - encoding 'Churn' values to binary values
model_frame['Churn'] = model_frame['Churn'].apply(lambda x: int(x == 'Yes'))
model_frame
| Churn | MonthlyCharges | TotalCharges | |
|---|---|---|---|
| 0 | 0 | 29.85 | 29.85 |
| 1 | 0 | 56.95 | 1889.50 |
| 2 | 1 | 53.85 | 108.15 |
| 3 | 0 | 42.30 | 1840.75 |
| 4 | 1 | 70.70 | 151.65 |
| ... | ... | ... | ... |
| 7027 | 0 | 84.80 | 1990.50 |
| 7028 | 0 | 103.20 | 7362.90 |
| 7029 | 0 | 29.60 | 346.45 |
| 7030 | 1 | 74.40 | 306.60 |
| 7031 | 0 | 105.65 | 6844.50 |
7032 rows × 3 columns
Implement the model predictive pass given the parameters; the folowing blr_predict() function is nothing else than the implementation of the following expression:
def blr_predict(params, data):
# - grab parameters
beta_0 = params[0]
beta_1 = params[1]
beta_2 = params[2]
# - predict: model function
x1 = data["MonthlyCharges"]
x2 = data["TotalCharges"]
# - linear combination:
lc = beta_0 + beta_1*x1 + beta_2*x2
ep = np.exp(-lc)
p = 1/(1+ep)
return(p)
Test blr_predict()
test_params = np.array([-2.7989, .0452, -.0006])
predictions = blr_predict(params=test_params, data=model_frame)
predictions
0 0.187309
1 0.204491
2 0.394181
3 0.120110
4 0.575848
...
7027 0.460025
7028 0.072292
7029 0.158578
7030 0.593878
7031 0.106194
Length: 7032, dtype: float64
Now define the Negative Log-Likelihood function:
from scipy.stats import binom
def blr_nll(params, data):
# - predictive pass
p = blr_predict(params, data)
# - joint negative log-likelihood
# - across all observations
nll = -binom.logpmf(data["Churn"], 1, p).sum()
return(nll)
Test blr_nll():
# - test blr_nll()
test_params = np.array([-2.7989, .0452, -.0006])
blr_nll(params=test_params, data=model_frame)
3284.57719221028
Optimize!
from scipy.optimize import minimize
# - initial (random) parameter values
init_beta_0 = np.random.uniform(low=-3, high=0, size=1)
init_beta_1 = np.random.uniform(low=-.05, high=.05, size=1)
init_beta_2 = np.random.uniform(low=-.001, high=.001, size=1)
init_pars = [init_beta_0, init_beta_1, init_beta_2]
# - optimize w. Nelder-Mead
optimal_model = minimize(
# - fun(parameters, args)
fun=blr_nll,
args = model_frame,
x0 = init_pars,
method='Nelder-Mead',
options={'maxiter':1e6,
'maxfev':1e6,
'fatol':1e-6})
# - optimal parameters
for param in optimal_model.x:
print(param)
-2.7989469189152993 0.045231320322325644 -0.0006181477896686361
/var/folders/d5/r01186gs5wn60_lwmj4g21cm0000gn/T/ipykernel_1261/465358202.py:10: DeprecationWarning: Use of `minimize` with `x0.ndim != 1` is deprecated. Currently, singleton dimensions will be removed from `x0`, but an error will be raised in SciPy 1.11.0.
optimal_model.fun
3283.3939159027886
Check against statsmodels
predictors = model_frame.columns.drop('Churn')
formula = ' + '.join(predictors)
formula = 'Churn ~ ' + formula
print(formula)
binomial_linear_model = smf.logit(formula=formula, data=model_frame).fit()
binomial_linear_model.summary()
Churn ~ MonthlyCharges + TotalCharges
Optimization terminated successfully.
Current function value: 0.466922
Iterations 6
| Dep. Variable: | Churn | No. Observations: | 7032 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 7029 |
| Method: | MLE | Df Model: | 2 |
| Date: | Fri, 28 Apr 2023 | Pseudo R-squ.: | 0.1936 |
| Time: | 23:40:00 | Log-Likelihood: | -3283.4 |
| converged: | True | LL-Null: | -4071.7 |
| Covariance Type: | nonrobust | LLR p-value: | 0.000 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -2.7989 | 0.089 | -31.548 | 0.000 | -2.973 | -2.625 |
| MonthlyCharges | 0.0452 | 0.001 | 31.620 | 0.000 | 0.042 | 0.048 |
| TotalCharges | -0.0006 | 1.97e-05 | -31.373 | 0.000 | -0.001 | -0.001 |
Plot the Negative Log-Likelihood Function
# - from statsmodels: beta_0 is -2.7989, beta_1 is .0452, and beta_2 is -.0006
beta_0 = -2.7989
beta_1_vals = np.linspace(-.05,.05,100)
beta_2_vals = np.linspace(-.001,.001,100)
grid = np.array([(beta_1, beta_2) for beta_1 in beta_1_vals for beta_2 in beta_2_vals])
grid = pd.DataFrame(grid)
grid = grid.rename(columns={0: "beta_1", 1: "beta_2"})
nll = []
for i in range(grid.shape[0]):
pars = [beta_0, grid['beta_1'][i], grid['beta_2'][i]]
nll.append(blr_nll(pars, model_frame))
grid['nll'] = nll
grid.sort_values('nll', ascending=False, inplace=True)
grid.head(100)
| beta_1 | beta_2 | nll | |
|---|---|---|---|
| 9999 | 0.050000 | 0.001000 | 17814.865806 |
| 9998 | 0.050000 | 0.000980 | 17570.648096 |
| 9899 | 0.048990 | 0.001000 | 17562.334042 |
| 9997 | 0.050000 | 0.000960 | 17326.772023 |
| 9898 | 0.048990 | 0.000980 | 17318.646658 |
| ... | ... | ... | ... |
| 9492 | 0.044949 | 0.000859 | 14892.359746 |
| 101 | -0.048990 | -0.000980 | 14878.889700 |
| 4 | -0.050000 | -0.000919 | 14845.861773 |
| 102 | -0.048990 | -0.000960 | 14821.273142 |
| 200 | -0.047980 | -0.001000 | 14796.700924 |
100 rows × 3 columns
# - import plotly
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"
# - Mesh3d: Objective Function
fig = go.Figure(data=[go.Mesh3d(
x=grid['beta_1'],
y=grid['beta_2'],
z=grid['nll'],
color='red',
opacity=0.50)])
fig.update_layout(scene = dict(
xaxis_title='Beta_1',
yaxis_title='Beta_2',
zaxis_title='NLL'),
width=700,
margin=dict(r=20, b=10, l=10, t=10))
fig.show()
The Multinomial Regression model is a powerful classification tool. Consider a problem where some outcome variable can result in more than two discrete outcomes. For example, a customer visiting a webshop can end up their visit (a) buying nothing, (b) buying some Product A, or (c) Product B, or (d) Product C, etc. If we have some information about a particular customer's journey through the website (e.g. how much time did they spend on some particular pages, did they visit the webshop before or not, or any other information that customers might have chose to disclose on their sign-up...), we can use it as a set of predictors of customer behavior resulting in any of the (a), (b), (c), (d). We do that by means of a simple extension of the Binomial Logistic Regression that is used to solve for dichotomies: enters the Multinomial Regression model.
First, similar to what happens in dummy coding, given a set of $K$ possible outcomes we choose one of them as a baseline. Thus all results of the Multionomial Regression will be interpreted as effects relative to that baseline outcome category, for example: for a unit increase in predictor $X_1$ what is the change in odds to switch from (a) buying nothing to (b) buying Product A. We are already familiar with this logic, right?
So, consider a set of $K-1$ independent Binary Logistic Models with only one predictor $X$ where the baseline is now referred to as $K$:
$$log\frac{P(Y_i = 1)}{P(Y_i = K)} = \beta_{1,0} + \beta_{1,1}X$$$$log\frac{P(Y_i = 2)}{P(Y_i = K)} = \beta_{2,0} + \beta_{2,1}X$$$$log\frac{P(Y_i = K-1)}{P(Y_i = K)} = \beta_{K-1,0} + \beta_{K-1,1}X$$Obviously, we are introducing a set of new regression coefficients ($\beta_{k,\cdot}$) for each possible value of the outcome $k = 1, 2,.., K-1$. The log-odds are on the LHS while the linear model remains on the RHS.
Now we exponentiate the equations to arrive at the expressions for odds:
$$\frac{P(Y_i = 1)}{P(Y_i = K)} = e^{\beta_{1,0} + \beta_{1,1}X}$$$$\frac{P(Y_i = 2)}{P(Y_i = K)} = e^{\beta_{2,0} + \beta_{2,1}X}$$$$\frac{P(Y_i = K-1)}{P(Y_i = K)} = e^{\beta_{K-1,0} + \beta_{K-1,1}X}$$And solve for $P(Y_i = 1), P(Y_i = 2),.. P(Y_i = K-1)$:
$$P(Y_i = 1) = P(Y_i = K)e^{\beta_{1,0} + \beta_{1,1}X}$$$$P(Y_i = 2) = P(Y_i = K)e^{\beta_{2,0} + \beta_{2,1}X}$$$$P(Y_i = K-1) = P(Y_i = K)e^{\beta_{K-1,0} + \beta_{K-1,1}X}$$From the fact that all probabilities $P(Y_i = 1), P(Y_i = 2), .., P(Y_i = K)$ must sum to one it can be shown that
$$P(Y_i = K) = \frac{1}{1+\sum_{k=1}^{K-1}e^{\beta_{k,0} + \beta_{k,1}X}}$$Because:
(1) $P(Y_i = 1) + P(Y_i = 2) + ... + P(Y_i = K) = 1$, we have
(2) $P(Y_i = K) + P(Y_i = K)e^{\beta_{1,0} + \beta_{1,1}X} + P(Y_i = K)e^{\beta_{2,0} + \beta_{2,1}X} + ... + P(Y_i = K=1)e^{\beta_{K-1,0} + \beta_{K-1,1}X} = 1$
(3) and then replace $e^{\beta_{1,0} + \beta_{1,1}X}$ by $l_1$, $e^{\beta_{2,0} + \beta_{2,1}X}$ by $l_2$, and $e^{\beta_{k,0} + \beta_{k,1}X}$ by $l_k$ in the general case, we have
(4) $P(Y_i = K) + P(Y_i = K)e^{l_1} + P(Y_i = K)e^{l_2} + ... + P(Y_i = K-1)e^{l_{K-1}} = 1$
(5) $P(Y_i = K)[1 + e^{l_1} + e^{l_2} + ... + e^{l_{K-1}}] = 1$ so that
(6) $P(Y_i = K) = \frac{1}{1 + e^{l_1} + e^{l_2} + ... + e^{l_{K-1}}} = \frac{1}{1+\sum_{k=1}^{K-1}e^{\beta_{k,0} + \beta_{k,1}X}}$
It is easy now to derive the expressions for all $K-1$ probabilities of the outcome resulting in a particular class:
$$P(Y_i = 1) = \frac{e^{\beta_{1,0} + \beta_{1,1}X}}{1+\sum_{k=1}^{K-1}e^{\beta_{k,0} + \beta_{k,1}X}}$$$$P(Y_i = 2) = \frac{e^{\beta_{2,0} + \beta_{2,1}X}}{1+\sum_{k=1}^{K-1}e^{\beta_{k,0} + \beta_{k,1}X}}$$$$P(Y_i = K-1) = \frac{e^{\beta_{K-1,0} + \beta_{K-1,1}X}}{1+\sum_{k=1}^{K-1}e^{\beta_{k,0} + \beta_{k,1}X}}$$# - loading the dataset
# - Kaggle: https://www.kaggle.com/datasets/uciml/iris
# - place it in your _data/ directory
iris_data = pd.read_csv(os.path.join(data_dir, 'Iris.csv'), index_col='Id')
iris_data.head(10)
| SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
|---|---|---|---|---|---|
| Id | |||||
| 1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
| 2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
| 3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
| 4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
| 5 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
| 6 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
| 7 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
| 8 | 5.0 | 3.4 | 1.5 | 0.2 | Iris-setosa |
| 9 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
| 10 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa |
# - counting the instances of each category
iris_data['Species'].value_counts()
Iris-setosa 50 Iris-versicolor 50 Iris-virginica 50 Name: Species, dtype: int64
# - info on the variables
iris_data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 150 entries, 1 to 150 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SepalLengthCm 150 non-null float64 1 SepalWidthCm 150 non-null float64 2 PetalLengthCm 150 non-null float64 3 PetalWidthCm 150 non-null float64 4 Species 150 non-null object dtypes: float64(4), object(1) memory usage: 7.0+ KB
We use Multinomial Logistic Regression Model to predict the most probable category for the given observation $\textbf{x}$ with features $(x_1, x_2, \ldots, x_k)$. Assume that our target variable $y$ belongs to one of categories from the set $\{1, 2, \ldots, C\}$. In MNR we usually select one category as the referrence category; let that be the category $C$. Then, the probability that the target variable $y$ belongs to category $c = 1,\ldots,C-1$ is calculated via
$$P(y = c) = \frac{e^{\beta^{(c)}_1x_1 + \beta^{(c)}_2x_2 + \cdots + \beta^{(c)}_kx_k + \beta_0}}{1+\sum_{j=1}^{C-1}e^{\beta^{(j)}_1x_1 + \beta^{(j)}_2x_2 + \cdots + \beta^{(j)}_kx_k + \beta_0}},$$and the probability that it belogns to the referrence category $C$ is
$$P(y = C) = \frac{1}{1+\sum_{j=1}^{C-1}e^{\beta^{(j)}_1x_1 + \beta^{(j)}_2x_2 + \cdots + \beta^{(j)}_kx_k + \beta_0}},$$where $\beta^{(j)}_1, \beta^{(j)}_2, \ldots, \beta^{(j)}_k,\ j=1,\ldots,C$ are the model's parameters for predictors and target variable categories, and $n$ is the intercept of the model.
After calculating all the probabilities $P(y = c),\ c=1,\ldots,C$ we predict the target variable as
$$\hat{y} = \textrm{argmax}_{c=1,\ldots,C}P(y=c).$$The model is estimated by MLE (Maximum Likelihood Estimation). For each category $c$ - except for the referrence $C$, of course - we obtain a set of coefficients. Each model coefficient, in each category, tells us about the $\Delta_{odds}$ in favor of the target category, for a unit change of a predictor, in comparison with the baseline category, and given that everything else is kept constant.
### --- Preparing the variables
# - feature matrix
X = iris_data.drop(columns='Species')
# - we append a constant column of ones to the feature matrix
X = sm.add_constant(X)
print(X[:10])
# - we impose the ordering to the categories of the target vector; the first category is the referrence category
cat_type = pd.CategoricalDtype(categories=["Iris-versicolor", "Iris-virginica", "Iris-setosa"], ordered=True)
y = iris_data['Species'].astype(cat_type)
const SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Id 1 1.0 5.1 3.5 1.4 0.2 2 1.0 4.9 3.0 1.4 0.2 3 1.0 4.7 3.2 1.3 0.2 4 1.0 4.6 3.1 1.5 0.2 5 1.0 5.0 3.6 1.4 0.2 6 1.0 5.4 3.9 1.7 0.4 7 1.0 4.6 3.4 1.4 0.3 8 1.0 5.0 3.4 1.5 0.2 9 1.0 4.4 2.9 1.4 0.2 10 1.0 4.9 3.1 1.5 0.1
# - fitting the MNR model to the data; we use the Newton's Conjugate Gradient method as the optimizer to compute the
# - models coefficients
mnr_model = sm.MNLogit(exog=X, endog=y).fit(method='ncg', maxiter=150)
mnr_model.summary()
Optimization terminated successfully.
Current function value: 0.039663
Iterations: 126
Function evaluations: 127
Gradient evaluations: 127
Hessian evaluations: 126
| Dep. Variable: | Species | No. Observations: | 150 |
|---|---|---|---|
| Model: | MNLogit | Df Residuals: | 140 |
| Method: | MLE | Df Model: | 8 |
| Date: | Fri, 28 Apr 2023 | Pseudo R-squ.: | 0.9639 |
| Time: | 23:40:08 | Log-Likelihood: | -5.9494 |
| converged: | True | LL-Null: | -164.79 |
| Covariance Type: | nonrobust | LLR p-value: | 7.056e-64 |
| Species=Iris-virginica | coef | std err | z | P>|z| | [0.025 | 0.975] |
|---|---|---|---|---|---|---|
| const | -42.6378 | 25.707 | -1.659 | 0.097 | -93.023 | 7.748 |
| SepalLengthCm | -2.4650 | 2.394 | -1.030 | 0.303 | -7.158 | 2.228 |
| SepalWidthCm | -6.6814 | 4.480 | -1.491 | 0.136 | -15.461 | 2.099 |
| PetalLengthCm | 9.4294 | 4.737 | 1.990 | 0.047 | 0.145 | 18.714 |
| PetalWidthCm | 18.2859 | 9.743 | 1.877 | 0.061 | -0.809 | 37.381 |
| Species=Iris-setosa | coef | std err | z | P>|z| | [0.025 | 0.975] |
| const | 1.2477 | 2830.080 | 0.000 | 1.000 | -5545.606 | 5548.102 |
| SepalLengthCm | 2.1595 | 800.555 | 0.003 | 0.998 | -1566.899 | 1571.218 |
| SepalWidthCm | 6.3676 | 394.523 | 0.016 | 0.987 | -766.884 | 779.619 |
| PetalLengthCm | -10.6759 | 516.903 | -0.021 | 0.984 | -1023.788 | 1002.436 |
| PetalWidthCm | -5.4912 | 980.189 | -0.006 | 0.996 | -1926.626 | 1915.644 |
# - confusion matrix for our model and given data; rows/columns are on par with the ordering of categorical variable
mnr_model.pred_table()
array([[49., 1., 0.],
[ 1., 49., 0.],
[ 0., 0., 50.]])
# - accuracy of the model
correct_classes = np.trace(mnr_model.pred_table())
print(f'Correct observations: {correct_classes}')
num_obs = np.sum(mnr_model.pred_table())
print(f'Total observations: {num_obs}')
print(f'The accuracy of our model: {round(correct_classes/num_obs, 4)}')
Correct observations: 148.0 Total observations: 150.0 The accuracy of our model: 0.9867
# - model's prediction of probabilities; columns correspond to the ordering of categorical variable
mnr_model.predict()[:10]
array([[9.20967484e-09, 1.42177555e-35, 9.99999991e-01],
[3.42387722e-07, 2.44370591e-32, 9.99999658e-01],
[5.07406119e-08, 6.06932699e-34, 9.99999949e-01],
[1.00687690e-06, 1.98158417e-31, 9.99998993e-01],
[6.04626221e-09, 6.12291889e-36, 9.99999994e-01],
[2.78376401e-08, 9.29455175e-34, 9.99999972e-01],
[8.87535142e-08, 5.70622243e-33, 9.99999911e-01],
[6.28372828e-08, 6.21641455e-34, 9.99999937e-01],
[1.90536565e-06, 9.09786880e-31, 9.99998095e-01],
[3.04189668e-07, 4.59068099e-33, 9.99999696e-01]])
# - model's prediction of categories; numbers correspond to the ordering of categorical variable
preds = np.argmax(mnr_model.predict(), axis=-1)
preds
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
# - Step 1: recode categorical target variable to integer values,
# - just in order to be able to run a multiple linear regression on the data:
y_code = y.cat.codes
y_code
Id
1 2
2 2
3 2
4 2
5 2
..
146 1
147 1
148 1
149 1
150 1
Length: 150, dtype: int8
### --- Step 2: produce a Multiple Linear Regression model for the data
mnr_model = sm.OLS(exog=X, endog=y_code).fit()
mnr_model.summary()
| Dep. Variable: | y | R-squared: | 0.571 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.559 |
| Method: | Least Squares | F-statistic: | 48.17 |
| Date: | Fri, 28 Apr 2023 | Prob (F-statistic): | 1.02e-25 |
| Time: | 23:40:08 | Log-Likelihood: | -119.03 |
| No. Observations: | 150 | AIC: | 248.1 |
| Df Residuals: | 145 | BIC: | 263.1 |
| Df Model: | 4 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -0.4405 | 0.509 | -0.866 | 0.388 | -1.446 | 0.565 |
| SepalLengthCm | 0.0872 | 0.143 | 0.608 | 0.544 | -0.196 | 0.371 |
| SepalWidthCm | 0.6832 | 0.149 | 4.586 | 0.000 | 0.389 | 0.978 |
| PetalLengthCm | -0.4413 | 0.142 | -3.117 | 0.002 | -0.721 | -0.161 |
| PetalWidthCm | 0.4198 | 0.235 | 1.789 | 0.076 | -0.044 | 0.884 |
| Omnibus: | 28.857 | Durbin-Watson: | 0.448 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 7.904 |
| Skew: | -0.222 | Prob(JB): | 0.0192 |
| Kurtosis: | 1.966 | Cond. No. | 91.8 |
# --- Step 3: compute VIFs for the predictors
predictors = iris_data.columns.drop('Species')
predictors
Index(['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm'], dtype='object')
# - appending the columns of ones to the predictors' data
model_frame_predictors = sm.add_constant(iris_data[predictors])
# - computing VIFs
vifs = [variance_inflation_factor(model_frame_predictors.values, i) for i in range(1, len(predictors)+1)]
vifs = np.array(vifs).reshape(1, -1)
pd.DataFrame(vifs, columns=predictors)
| SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | |
|---|---|---|---|---|
| 0 | 7.103113 | 2.099039 | 31.397292 | 16.141564 |
# - import scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
# - Preparing the variables
# - feature matrix
X = iris_data.drop(columns='Species').values
# - target variable
y = iris_data['Species']
N.B. scikit-learn does not implement the referrence category automatically!
# - Fitting the logistic model to the numerical data
# - scikit-learn does not implement the referrence category automatically
log_reg = LogisticRegression(solver='newton-cg', penalty='none')
log_reg.fit(X, y)
LogisticRegression(penalty='none', solver='newton-cg')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(penalty='none', solver='newton-cg')
# - coefficents of the model; rows correspond to the order of appearance of categories in the target variable
log_reg.coef_, log_reg.intercept_
(array([[ 4.88690435, 7.84898812, -12.84520362, -6.66561933],
[ -1.21084317, -0.58404658, 1.70789901, -5.81027259],
[ -3.67606118, -7.26494154, 11.13730461, 12.47589191]]),
array([ 2.32984794, 20.15404687, -22.48389481]))
# - model's accuracy
round(log_reg.score(X, y), 4)
0.9867
# - predictions
y_pred = log_reg.predict(X)
y_pred[:10]
array(['Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
'Iris-setosa', 'Iris-setosa'], dtype=object)
# - confusion matrix for the given data
# - rows/columns rows correspond to the order of appearance of categories in the target variable
confusion_matrix(y, y_pred)
array([[50, 0, 0],
[ 0, 49, 1],
[ 0, 1, 49]])
License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.